Goto

Collaborating Authors

 ray serve


Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

Yang, Yuting, Yuan, Tiancheng, Hashim, Jamal, Garrett, Thiago, Qian, Jeffrey, Zhang, Ann, Wang, Yifan, Song, Weijia, Birman, Ken

arXiv.org Artificial Intelligence

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex's pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.


ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

Fu, Yao, Xue, Leyang, Huang, Yeqi, Brabete, Andrei-Octavian, Ustiugov, Dmitrii, Patel, Yuvraj, Mai, Luo

arXiv.org Artificial Intelligence

Furthermore, LLM inference latency is difficult to predict because their response time depends on the output This paper presents ServerlessLLM, a locality-enhanced length, which can vary significantly [24, 39, 77], due to iterative serverless inference system for Large Language Models output token generation. To achieve low latency, processing (LLMs). ServerlessLLM exploits the substantial capacity and an LLM request often necessitates the use of several bandwidth of storage and memory devices available on GPU GPUs for durations ranging from seconds to minutes. In practice, servers, thereby reducing costly remote checkpoint downloads LLM service providers need to host a large number of and achieving efficient checkpoint loading. ServerlessLLM LLMs catered to different developers, leading to significant achieves this through three main contributions: (i) fast LLM GPU consumption [15] and impeding the sustainability of checkpoint loading via a novel loading-optimized checkpoint LLM services [19]. As a result, LLM inference services have format design, coupled with an efficient multi-tier checkpoint to impose strict caps on the number of requests sent to their loading system; (ii) locality-driven LLM inference with live services from their users (e.g., 40 messages per 3 hours for migration, which allows ServerlessLLM to effectively achieve ChatGPT [51]), showing the provider's current inability to locality-driven server allocation while preserving the low latency satisfy the LLM inference demand. Researchers [19] project of ongoing LLM inference; and (iii) locality-aware that LLM inference costs may increase by > 50 when it server allocation, enabling ServerlessLLM to evaluate the status reaches the popularity of Google Search. of each server in a cluster and effectively schedule model To reduce GPU consumption, LLM service providers are startup time to capitalize on local checkpoint placement. Our exploring serverless inference, as seen in systems like Amazon comprehensive experiments, which include microbenchmarks SageMaker [60], Azure [46], KServe [11] and Hugging-and real-world traces, show that ServerlessLLM surpasses Face [31].


Serving ML Models in Production: Common Patterns - KDnuggets

#artificialintelligence

This post is based on Simon Mo's "Patterns of Machine Learning in Production" talk from Ray Summit 2021. Over the past couple years, we've listened to ML practitioners across many different industries to learn and improve the tooling around ML production use cases. Through this, we've seen 4 common patterns of machine learning in production: pipeline, ensemble, business logic, and online learning. In the ML serving space, implementing these patterns typically involves a tradeoff between ease of development and production readiness. Ray Serve was built to support these patterns by being both easy to develop and production ready. It is a scalable and programmable serving framework built on top of Ray to help you scale your microservices and ML models in production.


How the Integrations Between Ray & MLflow Aids Distributed ML Production

#artificialintelligence

In this blog post, we're announcing two new integrations with Ray and MLflow: Ray Tune MLflow Tracking and Ray Serve MLflow Models, which together make it much easier to build machine learning (ML) models and take them to production. These integrations are available in the latest Ray wheels. You can follow the instructions here to pip install the nightly version of Ray and take a look at the documentation to get started. They will also be in the next Ray release -- version 1.2 Our goal is to leverage the strengths of the two projects: Ray's distributed libraries for scaling training and serving and MLflow's end-to-end model lifecycle management. Let's first take a brief look at what these libraries can do before diving into the new integrations.